Machine Learning Time Series Forecasting (LSTM) on OMRON connect Data
							
						May 2021 ~ OMRON Healthcare Europe
						Length:    0.5 mo (at 1.0 FTE)
							Programming languages:     Python (Pandas, time, datetime, Math, Matplotlib,
							NumPy, scikit-learn, TensorFlow)
							Data:   Over 4 million blood pressure measurements registered via OMRON connect by
							approximately 35 000 users, containing the recorded systolic, the device used, the time and date
							of each measurement 
							Problem description: 
							Build a multivariate, multistep, single-output LSTM that predicts the following two weekly averages
							of systolic measurements of active OMRON connect users
							Approach: 
							On top of the pre-processing done in the previous project (see the first paragraph from Approach and
							Results on Big Data Analysis with PySpark
								on OMRON connect data), the predictor features which will not be known ex-post were removed,
							such as the diastolic and pulse. However, it was assumed that the users would perform measurements
							with the same device because about 1% of the users changed their devices while using the app.
							Hence, the device-related variables were kept. Then, the measurements of each user were aggregated
							per week and structured relatively to their start date to allow modeling for multiple users.
							Afterward, the categorical features were one-hot encoded, and the continuous ones were standardized
							or normalized.
							
							Prior to modeling, the data was reshaped as tensors. Then, a sliding window function for creating
							lags and the validation dataset was designed. Subsequently, the sequential model was built to
							have four hidden  layers, namely two pairs of one LSTM layer with 100 units followed by a
							dropout layer to prevent overfitting. The simplified architecture of the neural network is
							visible below.
							 
							
							Finally, the main hyperparameters of the model, namely learning rate, optimizer, number of batches,
							epochs, and steps per epoch, were tuned using the Grid Search method with time-series
							cross-validation.
							Results: 
							Before the predictions could be compared against the withhold test set, they were scaled back to
							their original magnitude. Consequently, a baseline that naively estimates the next two weekly
							averages of each user as the last one registered was implemented. Its RMSE score with the
							test data was 7.31, while the LSTM model scored an RMSE of 4,97. Therefore, the LSTM network
							performed 47% better than the naive baseline.
						
